{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recurrent memory intro\n",
    "\n",
    "In the seminar you'll deploy recurrent neural network inside SARSA agent.\n",
    "\n",
    "The environment it plays is a simple POMDP of rock-paper-scissors game with exploitable opponent.\n",
    "\n",
    "#### Instructions\n",
    "\n",
    "First, read through the code and __run it as you read__. The code will create a feedforward neural network and train it with SARSA.\n",
    "\n",
    "Since the game is partially observable, default algorithm will won't reach optimal score. In fact, it's unstable and may even end up worse than random.\n",
    "\n",
    "After you ran the code, __find the two ```#YOUR CODE HERE``` chunks__ (mb ctrl+f) and implement a recurrent memory.\n",
    "\n",
    "Re-run the experiment and compare the performance of feedworward vs recurrent agent. \n",
    "RNN should be _much_ better, session __reward > 50__.\n",
    "\n",
    "After you're done with that, proceed to the next part, for it is going to be much more interesting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt",
    "\n",
    "import numpy as np",
    "\n",
    "%matplotlib inline",
    "\n",
    "\n",
    "# number of parallel agents and batch sequence length (frames)",
    "\n",
    "N_AGENTS = 10",
    "\n",
    "SEQ_LENGTH = 25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The environment we're going to use now is not a default gym env.\n",
    "\n",
    "It was instead written from scratch in `rockpaperscissors.py`.\n",
    "\n",
    "Morale: you can make your own gym environments easily with anything you want (including OS and the web, e.g. selenium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym",
    "\n",
    "from rockpaperscissors import RockPaperScissors",
    "\n",
    "\n",
    "\n",
    "def make_env():",
    "\n",
    "    env = RockPaperScissors()",
    "\n",
    "    return gym.wrappers.TimeLimit(env, max_episode_steps=100)",
    "\n",
    "\n",
    "\n",
    "# spawn game instance",
    "\n",
    "env = make_env()",
    "\n",
    "observation_shape = env.observation_space.shape",
    "\n",
    "n_actions = env.action_space.n",
    "\n",
    "\n",
    "env.reset()",
    "\n",
    "obs = env.step(env.action_space.sample())[0]",
    "\n",
    "\n",
    "print obs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic agent setup\n",
    "Here we define a simple agent that maps game images into policy with a minimalistic neural net\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# setup theano/lasagne. Prefer CPU",
    "\n",
    "%env THEANO_FLAGS = device = cpu, floatX = float32",
    "\n",
    "\n",
    "import theano",
    "\n",
    "import lasagne",
    "\n",
    "import theano.tensor as T",
    "\n",
    "from lasagne.layers import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# observation",
    "\n",
    "obs = InputLayer((None,)+observation_shape,)",
    "\n",
    "\n",
    "nn = DenseLayer(obs, 32, nonlinearity=T.nnet.elu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agentnet.memory import RNNCell, GRUCell, LSTMCell",
    "\n",
    "# YOUR CODE HERE:",
    "\n",
    "# Implement a recurrent agent memory by un-comemnting code below and defining h_new",
    "\n",
    "\n",
    "#h_prev = InputLayer((None,50),name=\"previous memory state with 50 units\")",
    "\n",
    "\n",
    "# h_new = RNNCell(<what's prev state>,<what's input>,nonlinearity=T.nnet.elu)",
    "\n",
    "\n",
    "# (IMPORTANT!) use new cell to compute q-values instead of dense layer",
    "\n",
    "#nn = h_new"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agentnet.resolver import EpsilonGreedyResolver",
    "\n",
    "l_qvalues = DenseLayer(nn, n_actions)",
    "\n",
    "l_actions = EpsilonGreedyResolver(l_qvalues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Agent, as usual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agentnet.agent import Agent",
    "\n",
    "# YOUR CODE HERE",
    "\n",
    "# uncomment agent_states and define what layers should be used",
    "\n",
    "\n",
    "agent = Agent(observation_layers=obs,",
    "\n",
    "              policy_estimators=(l_qvalues),",
    "\n",
    "              # agent_states={<new rnn state>:<what layer should it become at next time-step>},",
    "\n",
    "              action_layers=l_actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pool, as usual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agentnet.experiments.openai_gym.pool import EnvPool",
    "\n",
    "\n",
    "pool = EnvPool(agent, make_env, n_games=16)  # may need to adjust",
    "\n",
    "\n",
    "pool.update(SEQ_LENGTH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Learning\n",
    "\n",
    "For N+1'st time, we use vanilla SARSA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "replay = pool.experience_replay",
    "\n",
    "\n",
    "qvalues_seq = agent.get_sessions(",
    "\n",
    "    replay,",
    "\n",
    "    session_length=SEQ_LENGTH,",
    "\n",
    "    experience_replay=True,",
    "\n",
    "    unroll_scan=False,  # this new guy makes compilation 100x faster for a bit slower runtime",
    "\n",
    ")[-1]",
    "\n",
    "\n",
    "auto_updates = agent.get_automatic_updates()  # required if unroll_scan=False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get SARSA mse loss",
    "\n",
    "from agentnet.learning import sarsa",
    "\n",
    "elemwise_mse = sarsa.get_elementwise_objective(qvalues_seq,",
    "\n",
    "                                               actions=replay.actions[0],",
    "\n",
    "                                               rewards=replay.rewards,",
    "\n",
    "                                               is_alive=replay.is_alive)",
    "\n",
    "loss = elemwise_mse.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute weights and updates",
    "\n",
    "weights = lasagne.layers.get_all_params([l_actions], trainable=True)",
    "\n",
    "\n",
    "updates = lasagne.updates.adam(loss, weights)",
    "\n",
    "\n",
    "# compile train function",
    "\n",
    "train_step = theano.function([], loss, updates=auto_updates+updates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Demo run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "untrained_reward = np.mean(pool.evaluate(save_path=\"./records\", n_games=10,",
    "\n",
    "                                         record_video=False, use_monitor=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# starting epoch",
    "\n",
    "epoch_counter = 1",
    "\n",
    "\n",
    "# full game rewards",
    "\n",
    "rewards = {0: untrained_reward}",
    "\n",
    "loss, reward = 0, untrained_reward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import trange",
    "\n",
    "from IPython.display import clear_output",
    "\n",
    "\n",
    "for i in trange(10000):",
    "\n",
    "\n",
    "    # play",
    "\n",
    "    pool.update(SEQ_LENGTH)",
    "\n",
    "    # train",
    "\n",
    "    loss = train_step()",
    "\n",
    "\n",
    "    # update epsilon",
    "\n",
    "    new_epsilon = max(0.01, 1-2e-4*epoch_counter)",
    "\n",
    "    l_actions.epsilon.set_value(np.float32(new_epsilon))",
    "\n",
    "\n",
    "    # record current learning progress and show learning curves",
    "\n",
    "    if epoch_counter % 100 == 0:",
    "\n",
    "        clear_output(True)",
    "\n",
    "        print(\"iter=%i,loss=%.3f,epsilon=%.3f\" %",
    "\n",
    "              (epoch_counter, loss, new_epsilon))",
    "\n",
    "        reward = 0.9*reward + 0.1*np.mean(np.mean(pool.evaluate(save_path=\"./records\", n_games=10,",
    "\n",
    "                                                                record_video=False, use_monitor=False)))",
    "\n",
    "        rewards[epoch_counter] = reward",
    "\n",
    "\n",
    "        plt.plot(*zip(*sorted(rewards.items(), key=lambda (t, r): t)))",
    "\n",
    "        plt.grid()",
    "\n",
    "        plt.show()",
    "\n",
    "\n",
    "    epoch_counter += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluating results\n",
    " * Here we plot learning curves and sample testimonials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(*zip(*sorted(rewards.items(), key=lambda k: k[0])))",
    "\n",
    "plt.grid()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus (1++ points)\n",
    "\n",
    "Compare two types of nonlinearities for the RNN:\n",
    "- `T.nnet.elu`\n",
    "- `T.nnet.sigmoid`\n",
    "\n",
    "Re-train agent at least 10 times. It's probably a good idea to automate the process.\n",
    "\n",
    "Notice something weird? Any clue why this happens and how to fix it?\n",
    "\n",
    "_Running the experiment and reporting results gets your 1 point. Reward will get much higher as you go down the rabbit hole! Don't forget to send this notebook to anytask and mention that you went for this bonus._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# results, ideas, solutions..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n",
    "```\n",
    "\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
